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Abstract 


The crucial role played by interpretability in many practical scenarios has lead a large 
part of the research on machine learning towards the development of interpretable ap- 
proaches. In this work, we present a game-theory based method capable of achieving 
state-of-the-art accuracy, yet keeping the focus on the interpretability of the predic- 
tions. The proposed approach is an instance of the more general preference learning 
framework. By design, the method identifies the most relevant features even when 
dealing with high-dimensional problems. This is possible thanks to an online features 
generation mechanism. Moreover, the algorithm is proven to be theoretically well- 
founded, thanks to a game theoretical analysis on its convergence. To assess the qual- 
ity of the proposed approach, it has been compared against state-of-the-art methods 
in a plethora of different classification settings. The experimental evaluation focuses 
on interpretability, with an in-depth analysis on visualization, feature selection and ex- 
plainability. 
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1. Introduction 


Game theory (GT) and computer science have always had a strong bond. Much of 
the current research in GT dates back to the work of computer science pioneers like, for 
example, John Von Neumann and Alan Turing. In its early days, artificial intelligence 
research made an extensive use of games as training environments for the development 
of novel algorithms. Game theoretical concepts are at the core of many machine learn- 
ing (ML) approaches: reinforcement learning and imitation learning are two of many 
possible examples. However, Adversarial learning is by far the hottest ML topic 
related to GT, since its competitive nature made it the perfect learning framework for 
applicative areas such as cyber security [2]. 

Historically, adversarial concepts are also at the basis of the seminal work that 
introduced Adaboost [3], in which GT is applied to on-line learning. The GT-ML 
connection has been also extensively studied in the context of Support Vector Machine 
(SVM). For instance, it can be shown that the hard margin SVM can be cast into a 
two-players zero-sum game [4] [5]. 

In this paper, we present a principled algorithm, dubbed PRL (Preference and Rule 
Learning), inspired by preference learning [6] and game theory. PRL aims at maximiz- 
ing the minimum margin in the space of preferences represented using the Kessler’s 
construction [7]. In PRL, the learning problem is cast into a two-players zero-sum 
game, where a player tries to select hypotheses for maximizing the margin, and the 
opponent chooses adversarial preferences in order to minimize it. The considered hy- 
potheses spaces consist in a set of preference prototypes along with (possibly non- 
linear) features. One important characteristic of PRL is that, by design, feature selec- 
tion represents an integral part of the learning process. 

To deal with high dimensional data and high dimensional feature spaces, PRL gen- 
erates features in an on-line fashion. As we will show later, the on-line feature gen- 
eration plays a very important role in PRL, especially when it comes to interpret the 
solution of the model, which is very useful for producing explanations. 

Nowadays, machine learning methods are widely used by non-practitioners and 


having the ability of interpreting their model is often desirable. There are plenty of 
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applications in which explanation plays a key role, such as bioinformatic applica- 
tions, recommender systems, and support systems for physicians, just to mention a 
few. Moreover, the notion of explainability of automatic systems is also one of the 
most controversial subject contained in the recently introduced European regulation 
GDPR (General Data Protection Regulation). PRL offers the ability to design feature 
spaces composed by logical rules and thanks to its feature selection capabilities gives 
the opportunity to interpret the provided prediction. 


To summarize, the main contributions of this work are listed in the following: 


e a new large margin method based on preference learning and game theoreti- 
cal concepts for label ranking/multi-class classification. The method naturally 
comes with feature selection capabilities. PRL is also able to deal with (non 
linear) feature spaces of infinitely many dimensions, thanks to the online feature 


generation; 


e the framework is general enough to deal with different kinds of features and rules 


that are very useful when interpretability is desired; 
e a theoretical study on the convergence to the optimal solution; 
e aparallelized version of Fictitious Play [8] for solving game matrices; 


e an extensive set of experiments is reported. Experiments have been performed 
with the aim of assessing different aspects of PRL: (i) effectiveness, (ii) feature 
selection capability, and (iii) interpretability. Results show that PRL is able to 
provide sparse solutions that are suitable when the explanation of the decision is 


desirable. 
The work presented here extends the paper [9], in particular: 


e we propose a parallelized version of the algorithm Fictitious Play for solving 


game matrices and demonstrate its convergence (Section[3); 
e we introduce the decision path rules generation scheme (Section|6.3); 


e we also present a dynamic budget version of PRL which automatically changes 


the columns’ budget when needed; 
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e we add more details and experiments regarding the bias feature, showing its 


usefulness in producing sparser solutions; 


e we integrate the evaluation with additional experiments, e.g., using the decision 


path rules generation scheme (Section[7). 


The remainder of this paper is structured as follows: Section [2| will introduce the 
background knowledge useful to understand the theoretical concepts of the paper. Sec- 
tion B]presents Parallel Fictitious Play. Section[4]and[5] will present the main contribu- 
tion of the paper, that is the PRL method. In Section [6|the on-line feature generation 
used in PRL is presented, providing different feature generation schemes. Section [7] 
is dedicated to the evaluation of the proposed approach. Section [8] discusses works 
related to PRL, especially connections between machine learning and game theory, as 
well as other on-line non-linear feature selection approaches. Finally, Section |9]wraps 


up the contribution of the paper and discusses possible future works. 


2. Background 


In this section we present all the necessary notions to understand the rest of the 
paper. In particular, we introduce both Preference Learning (Section 2.1) and Game 
Theory (Section|2.2) focussing on the key elements used throughout the paper. Finally, 
in Section [3] we present a parallel version of the classical Fictitious Play algorithm 


which is described in sections[2.3]and B] 


2.1. Preference Learning 


Preference learning (PL) is a sub-task in machine learning in which the input data 
consists of preference relations. Such preferences are assumed to be in agreement with 
some utility function gg. In PL problems, the goal is to build a preference model, i.e., 
find the parameters 0 of the utility function g able to predict preferences for previously 
unseen items. In the context of label ranking, for instance, the training set consists of 
a set of pairwise preferences y; >x yj, i # j, that is, for the pattern x, label y; is 


preferred to label yj. 
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In this work we consider one of the main PL tasks, that is label ranking [6]: 
given a set of input patterns x; E€ X, i € [1,..., n], and a finite set of labels Y = 
{Y1, Y2, - - - , Ym } the goal is to learn a scoring function gg : ¥ x VY > R which assigns 
a score for each instance-label pair (x, y). It is worth to notice that label ranking repre- 
sents a generalization of a classification task since gg implicitly defines, for an instance 
x, a total order over VY. We focus on linear preference models of the form 
go(x, y) = wTw(x, y), where 6 = w € R@" is the vector of model parameters, and 
Yp: X xY > Rİ”, X = RI, Y = {1,...,m} is a joint representation of instance 
and label pairs. 

Given an item, the goal of the model gg is to correctly rank the labels according to 
the preferences, that is, given a preference y; >x yj then go(x, yi) > go(x, yj) should 
hold, and thus 


wi(x, yi) > wld(x, yj) > wl (d(x, yi) — U(X, y;)) > 0. (1) 


Equation (1) can be interpreted as the margin (or confidence) of the preference y; >x Yj 
and, intuitively, large margin on training instances lead to good generalization capabil- 
ity of the ranker [12]. 

The instance-label joint representation used in PRL is based on the Kessler’s con- 
struction for multi-class classification [13] [4 [15] [7]. The Kessler’s construction is a 
very powerful tool to reduce learning problems [6] that allows, through an appropriate 
instances’ representation, to solve multi-class problems using a single linear function 
instead of decomposing them into many binary sub-problems. The Kessler’s construc- 
tion can be formalized as in the following: given an instance (possibly) embedded in a 
feature space, i.e., d(x), associated with label y, we define the instance-label represen- 
tation % as 


oI 


W(x), y) = el" @ (x) =( 0 as 
tt p 3 


where the symbol & indicates the Kronecker product, e; is the y-th canonical basis of 
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IR™, and O are d-dimensional zero vectors. Therefore, given a preference yi >x Yj We 


can construct its corresponding embedding z € R4™ as 


z = Ylex), yi) — YE), yy) = (ey; — ef) 8 9x) 


= (0;...; d(x) ;0;...; —(x) ;0;...;0) E R2”, 
t T 
Yi Yj 


With this definition of a preference z, we can rewrite the margin (Equation (1p) as 
w'z. Note that, if x is defined directly in the input space, ¢ corresponds to the identity 
function. At prediction time, given a new instance Xnew, labels are ranked according to 
the score gw((Xnew), y), Vy € Y. Incase of classification, the predicted label for Xnew 


is the one that maximizes the achieved score, that is, 


j = a a. w Xnew), . 
ĝ = arg max g (d(Xnew), y) 


2.2. Game Theory 


Game theory is a branch of mathematics that studies the behaviour of rational game 
players who are trying to maximize their utility. For the purposes of this work, we focus 
on two-players zero-sum games, which are by definition non-cooperative games. 

The strategic form of a two-players zero-sum game is defined by a triplet (P, Q, M), 
where P and Q are finite non-empty set of (pure) strategies for player P and Q, respec- 
tively, and M : P x Q > R is a function that associates a value M (i, j) to each pair 
of pure strategies (i, j) s.t. i € P, and j € Q. Since P and Q are finite sets, the func- 
tion M can be represented as a matrix M € R!!*!@!, dubbed payoff matrix (or game 
matrix), such that M; ; = M (i, j), where | P| and |Q] are the number of available pure 
strategies for P and Q, respectively. Each matrix entry M; j represents the loss of P, or 
equivalently the payoff of Q, when the strategies ¿ and 7 are simultaneously played by 
the two-players. 

The game is held in rounds. At each round, the row player P and the column player 
Q, play simultaneously: P picks a row, while Q picks a column of M € R!PIXIQI, 


The correspondent entry in M is the loss incurred by P or equivalently the payoff of 
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Q. Clearly, the two players have opposite goals: player P wants to find a strategy that 
minimizes its loss, while player Q aims at defining a strategy that maximizes its payoff. 
Typically, the players strategies are randomized over the rows/columns of the game 
matrix, that is, player P selects a row according to a probability distribution p over the 
rows, and, similarly, player Q selects a column according to a probability distribution 
q over the columns. This type of strategies are called mixed strategies, and they are 
typically represented as stochastic vectors, i.e., p E€ “p and q € Q, respectively, 
where .%p = {p € RY! | |ipll1 = 1} and Hq = {q € R' | jall = 1}. 

It is well known that the best pair of optimal strategies (p*, q*), that is, the 
saddle-point (or Nash equilibrium) of M, can be computed by 


V* = p*'Mq* = min max p'Mq = max min p'’Mg, (2) 


where V* is known as the value of the game. 
The saddle-point solution of Equation can be found in polynomial time using 


linear programming. 


2.3. Approximating the solution 


From a computational point of view, solving high dimensional game matrices through 
linear programming can become prohibitive. A possible way for addressing this com- 
putational issue is to rely on approximated solutions. There is a large body of research 
in the game theory community which deals with the problem of approximating the 
value of the game for huge game matrices. 

Freund et al. proposed an adaptive approach to compute an approximate 
saddle-point strategy using multiplicative weights. This algorithm, dubbed Adaptive 
multiplicative weights (AMW), is guaranteed to come close to the minimum loss achiev- 
able by any fixed strategy. An incremental version of AMW, called i-AMW, has been 
recently proposed by Bopardikar et al. [19]. Same authors, previously presented a 
randomized approach in which each player chooses its best mixed strategy on a sam- 
pled set of rows/columns, that is, the payoff matrix is a submatrix of the whole payoff 


matrix. Authors showed that, with sufficiently large submatrices, there exists a proba- 
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bilistic guarantees about the quality of the approximation. 

In this work we rely on the Fictitious Play (FP) algorithm (a.k.a. Brown- 
Robinson learning process) that is one of the first methods proposed in the literature 
for approximating the solution of a game. We opt for FP because of its simplicity and 
its efficiency w.r.t. other approaches like AMW. 

FP is a greedy approach that works as follows: a player picks an initial random 
pure strategy, then, in turn, each player picks its next pure strategy as the best response, 
assuming the opponent picks at random according to the distribution defined by its 
previous choices. In other words, at each round both players try to infer the opponent 
mixed strategy on the basis of its previous selections. The pseudo-code of FictPlay is 


reported in Algorithm|I] 


Algorithm 1: FictPlay: Fictitious Play algorithm 
Input: M € R?P*®: matrix game, 
Te: number of iterations 
Output: p,q: row/column player strategy, 
V: the value of the game 


1r¢randint(1, P] 
2 Sp, Vp + 0,0 
3 Sq Va + Mr, €f 


Tir Sp 


4 fort «+ 1 to T, do 

5 q+ arg maX Sq, Sp + Sp +M. 4 

6 p+ arg min Sp, Sq < Sq + Mb, 
Q P 

7 Vq & Vq + eG; Vp — Vp + €p 

8 end 

9 P+ Vp/|lVplli 


10 q & Vq/||Vall1 
u V¢p'Mq 
12 return p,q, V 


In Algorithm|I] Sp represents the unnormalized expected value when player Q plays 
according to Sg. Analogous considerations are valid for s4. M,,; and M, < indicate the 
r-th row and the c-th column of the matrix M, respectively. Observe that, given the 
starting pure strategy for player P, Fictitious Play is deterministic: subsequent execu- 
tions of FictPlay will produce the same strategies p and q. 


Considering that the result of Fictitious Play is an approximation of the optimal 
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strategies, it is necessary to define bounds to describe the quality of the approximation. 
In particular, given qz and p+ the approximated strategies after t iterations of Fictitious 


Play, the bounds are computed as 
V = min Mq; and V = max p]M. 


Namely, the lower bound V corresponds to the minimum payoff that player P 
would receive playing the best pure strategy against the mixed strategy qz for player Q. 
On the other hand, the upper bound V is the payoff achieved by Q, playing its best pure 
strategy against the approximated mixed strategy p. Subsequent iterations of Ficti- 
tious Play will lead (non monotonically) to a better approximation of q*, and thus to 
higher lower bounds until convergence is reached. 

If the optimal strategy q* is found, all pure strategies for P that have weight in 
the optimal strategy p*, will lead to the very same upper bound. The same reasoning 
holds for player Q. Thus, as the Nash equilibrium requires, no player would benefit 
from changing unilaterally their own strategy. In such case, upper and lower bound are 


equal to the value of the game and convergence is reached. 


3. Parallel Fictitious Play 


The computational cost of approximating the optimal strategies for players P and Q 
using Fictitious Play is O (Te-max(| P|, |Q|)). Being actions chosen at time t dependent 
on Sp and sq, which are computed using previously selected pure strategies, Fictitious 
Play cannot be directly parallelized. 

Empirically, the quality of the solution, computed as the gap between bounds (see 
Figure[I(a)}, has a fast initial drop toward the optimal strategies and a subsequent slow 
asymptotic convergence toward the real values. 

To exploit the advantages of the first part of the search, without burdening the algo- 
rithm with the second part (slow and less fruitful), we propose a simple, yet effective, 
parallel research of the optimal strategies. This version of FictPlay, which will be re- 


ferred to as Parallel FictPlay, consists in computing (possibly in a parallel fashion) 
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Figure 1: (a) Distance between lower and upper bound on the game value either using mixed strategies 
approximated by a single run of Fictitious Play algorithm or averaging over multiple executions. The 


payoff matrixe IR958* (277-958) is based on a polynomial version in a preference learning setting of 
tic-tac-toe dataset and describes a zero sum game, where the players can either win (+2), lose (-2) 
or have a tie (0). The value of the game is approximatively 0.3. (b) Lower and upper bound on the value of 
the game described by the same matrix used to build the plot in Figure[I(@)] 


different approximated strategies {p\, wee pi} and laf, EE qi}, and then aver- 
aging over the found strategies. Empirically, Figure [I(a)] and Figure [I(b)| show how 
Parallel FictPlay can achieve better performances, with more strict bounds on the ap- 
proximated solutions. Given the low convergence rate as the bounds approach the real 
value of the game, Parallel FP can achieve similar bounds to FP in almost half of the 
iterations. 

Finally, it can be noted that, given the deterministic nature of Fictitious Play, to 
obtain different strategies using parallel FictPlay it is necessary, yet not sufficient, that 
the different executions of FictPlay have different starting points. Under this assump- 
tion, we can see parallel FictPlay as the parallelized version of sequential FictPlay 


with random restarts. 
Theorem 1. Parallel FictPlay converges to the optimal solution. 


Proof. We know that a single execution of FP converges to the optimal solution, so 


there exists a number of iterations t > T and an arbitrary small € > 0 such that 


Vi € [1, P], pi -e<p <pr+e and Wie [1,Q], -e< qh <q +e 
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Given an execution of Parallel FP with k FP, its strategy is the average of the 


strategies of the single FP. Thus, given T and e we have that: 


wiv q” ig 
—(t > —(t a (t) 
p = er a = a 


s=1 


(t) (t) 


where p and Ẹ “` are the strategies of Parallel FP after T iterations. 
=(t) =(t) 


Thus, we can bound the difference of p” and q © w.r.t. the optimal strategies, i.e., 


Wie [1,P], pipe <p <pi+e and Wie [1,Q], g-e<7P 


<qt+e. 
Since € is a bound for all the single FP, we can also affirm that on expectation the 

Parallel FP bound on the strategies’ entries is stricter than e. This is easy to show since 

the bound is € also for Parallel FP iff for all s € [1, k] there exists an entry in either 


pe ) or af ) such that it differs from the corresponding entry in p* or q* of exactly +e. 


In all other cases, averaging over the strategies guarantees a stricter bound than e. 


4. Preference Learning: a game theoretic perspective 


In this section we introduce the theoretical principles that underlies PRL. Through- 
out the section we assume a training set of N preferences of the form (y+ >x y_). 
Such preferences are converted into their corresponding vectorial representation using 
the Kessler’s construction as described in Section[2_1] 

As mentioned previously, we consider an hypothesis space H composed by linear 
functions of preference representations, i.e., H = {z > w'z | w,z € R?}. Given a 
preference z, we say that z is satisfied by a hypothesis w iff w™z > 0, that is, when the 
margin of the preference p(z) = wz is strictly positive. The margin of a preference 
can be considered as the confidence of the hypothesis w over the preference z. 

PRL aims at finding the linear hypothesis w in the preference space that maximizes 
the minimum margin over the training preferences. According to the Representer The- 


orem we know that the maximal margin hypothesis w can be defined as a 
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convex combination of the training preferences, that is 
wn y QjZj, Œœ E Sp. 
J 
Hence, the margin of a preference z can be rewritten as 


N N 
plz) = w'z = X ajzjz= > a; > myzilf]alsl 


j=1 fEF 
N 
= > ys ajuszj|f]" z[f] = 5 qg,p zlil zif], 
j=1 fEF GF) 


where the dot product Z}Z is generalized by assigning weights to the features according 
to a distribution yz over the features, and q is a new distribution over all the possible 
preference-feature pairs such that qj, f) = ajuys. F is the (potentially infinite) set of 
(possibly non-linear) features and z[f] is the sub-vector of z corresponding to the f-th 
feature in the Kessler’s construction. 

Now, let assume that an adversary chooses a distribution p over the training prefer- 
ences with the goal of minimizing the expected margin achieved by the hypothesis w 


on the training preferences. Given p, the expected margin will be defined by 


N 
P(p,q) = Soi 5 dy, fy 2ilf]" 2; [f] = p™Ma (3) 
=t (3f) 
where M;(j,f) = Zi[f]"Z;[f], with m the function which maps preference-feature 
pairs onto univocal indexes, i.e., t, f) € [1, N|F]]. 

It is pretty evident that Eq. (3) has a strong relation with the two-players zero-sum 
game presented in Section [2.2] Specifically, consider a two-players zero-sum game 
where the row player P (the nature) picks a preference from a distribution over the 
whole set of training preferences (i.e., the rows) aiming at minimizing the expected 
margin p. Simultaneously, the opponent player Q (the learner) picks a column from a 
distribution over the set of preference-feature pairs (i.e., the columns) aiming at maxi- 


mizing the expected margin (payoff). Then, the value of the game, that is the maximal 
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minimum margin solution is given by 
V = p(p*,q*) = min max p'Mq, 
P q 


that is exactly the same as Equation (2). In other words, searching for the distribution 
q maximizing the minimum margin in the training set is equivalent to find the saddle- 


point solution of the game matrix M. 


5. PRL: Preference and Rule Learning 


The number of columns of the game matrix M is equal to the number of all possi- 
ble preference-feature pairs., i.e., N|F|. In general, such amount is huge and solving 
the game using standard off-the-shelf algorithms from game theory is infeasible. Un- 
fortunately, using approximated methods does not solve the issue especially because, 
potentially, the number of columns can also be infinite (F| — 00). 

To overcome this problem, we propose a new incremental method for solving the 
game, that we call PRL (Preference and Rule Learning algorithm). The main idea 
behind PRL is to consider only a fraction of the columns of the whole game matrix. It- 
eratively, each sub-game is solved and the columns that do not contribute to the strategy 
of the column player are replaced by new randomly selected columns. 

Formally, let M be the game matrix and let (p*, q*, V*) be its corresponding opti- 
mal solution. At each iteration the algorithm considers a subset of columns of M, that 
is M; = MII; where I; € {0, 1}8xB are left-stochastic (0,1)-matrices, i.e., matrices 
whose entries belong to the set {0, 1} and whose columns add up to one. B, that we 
call budget, is the number of columns considered at each iteration. 

Let now consider the solution (pj, qj, V) of the matrix M; computed at iteration 
t. At the end of each iteration, the algorithm replaces the columns of M, correspond- 
ing to null entries in qj (which do not contribute in the solution) with new columns 
randomly drawn from the whole set of available columns. In this setting, the following 


theorem holds. 


Theorem 2. Jn PRL, at each iteration, the value of the game increases monotonically 
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and it is upper bounded by the optimal margin, that is the value of the game when 


considering the full matrix M. 


Proof. Let assume of being at iteration t + 1: a new left-stochastic (0,1)-matrix IT;+1 
is taken into account that is II; where every row corresponding to null entries in q; 


have been substituted with a new random stochastic vector e?, Thus, it holds that 


Ve = pi Miq} = p;' MI; (4) 
< p;},MIIq; (5) 
= p MIL q} (6) 
< pý MIT 19744 (7) 
= pit Miq = Vi (8) 


and 
Vt, Vi" = p;'MILq; < p*TMII,q; < pMa“ = V". 
SY 
Ge 
Equivalence is trivial since M; = MII, by definition. Inequality holds 
because the strategy př; is suboptimal for M;. In (6) we simply replaced columns 
of the game matrix corresponding to null entries of q; which does not affect the value. 


Finally, inequality (7) is true because q; is suboptimal for M;+1, and similar consid- 


erations can be done for the last series of inequalities. 


The pseudo-code of the full algorithm is given in Algorithm[2] 

It is worth to notice that the algorithm does not require any prior knowledge about 
M, and the number of columns can be also infinite. Thus, a natural approach for 
dealing with potentially infinite game matrices is to use an on-line column generation 
approach as we will discuss in Section|6] 

Figure [2] shows a visual overview of PRL. In the figure the three main phases of 
PRL are highlighted: (i) the learning phase takes the training preferences and the fea- 
ture generator to produce the game matrix that it is incrementally solved as described 


in Section |5} (ii) the learned hypothesis is then used to make prediction for unseen 
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Algorithm 2: PRL: Preference and Rule Learning 


Input: P: set of training preferences 
Fen : random feature generator 

B: size of the working set 

T: number of epochs 

Te: number of iterations of FictPlay 
Output: Q: working set of hypothesis 


q: mixed strategy in Q 


1 random initialization of the set Q such that |Q| = B 

2 compute the matrix game M on the basis of P (rows) and Q (cols) 
3 fort + 1toT do 

4 p,q, v < FictPlay(M, T.) 


5 ift < T then 

6 foreach (j, f) | qg,f) = 0 do 

7 Gj", P) — pick(P), Feen() 

8 update Q: replace (j, f) with (j’, f’) 

9 update columns of M w.r.t. Q: 

10 let k the position of (j’, f’) in Q, 
11 for alli € P, Myx = z| f] z; [f] 
12 end 
13 end 
14 end 


15 return q, Q 


preferences, and when it is possible (iii) the prediction is explained using the learned 


feature weights. 


5.1. PRL with dynamic budget 


Generally speaking, there is not a valid heuristic for setting the budget B to a value 
that can give some guarantees. For this reason the hyper-parameter B could be over- 
estimated leading to poor efficiency of the algorithm, or conversely, underestimated 
leading to weak solutions. To tackle this problem, we propose a variant of PRL with 
dynamic budget. The main difference w.r.t. the classical PRL is that B now represents 
the minimum number of columns that are replaced at each iteration, and not the total 
number of available columns. 

Let us make an example. Let B = 10 and let the initial number of columns of 


M be equal to 20. Let assume that after iteration 1, there are 14 columns such that 
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Figure 2: Visual overview of PRL. The black section represent the training phase with a “zoom” on the 
columns replacement policy; The blue section is the prediction phase that is necessary for the interpretation 
phase (red section) which uses the learned feature weights and the test preference to interpret the prediction. 


qj,f) > 0. Then, at the end of iteration 1, these 14 columns will be kept in M and 
other B randomly generated columns will be added (actually, 6 columns are replaced 
and 4 added). In this way, the total number of columns of M at iteration 2 will be 24. 
The drawback of this technique is that the number of column of M can potentially 
become very big. However, empirically (see Section|7), in all the performed experi- 
ments, M has always kept a reasonable number of columns (not much higher than the 
initial B). It is worth noticing that for this variant of the algorithm all the theoretical 


properties of PRL are preserved. 


6. On-line feature generation 


In PRL, on-line columns generation is one of the most important component. As 
mentioned in the previous sections each column of the game matrix is defined by a 


preference-feature pair. So, in order to generate a column, we need to independently 
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draw a preference and a feature. Hence, it is crucial to define how the on-line feature 
generator works. We propose different feature generation schemes based on different 
types of features, specifically: polynomial features, decision rules, and decision tree 
paths. 


In Algorithm|2]the function Fen refers to a generic feature generator scheme. 


6.1. Polynomial features generation 

The polynomial feature generation scheme produces features that are taken from 
the feature space of the homogenoeus polynomial kernel. For instance, given an n- 
dimensional instance x some possible polynomial features of degree 3 are: £1£2£n, 
x7a3 and x3. It is worth to notice that, when the input variables are binary-valued, 
polynomial features are highly interpretable since monomials correspond to logical 


conjunctions, e.g., if x; € {0,1} then z1£2£n := £1 A £2 A Tn. 


6.2. Rules generation 

When it comes to interpret machine learning models, logical rules (as in decision 
trees) are the most natural choice. In order to introduce interpretable features in PRL, 
we propose a rules generator scheme. To generate rules, we must take into account the 
nature of the input variables. For example, in the case of binary valued input variables 
a rule is simply their truth value. However, when dealing with continuous variables a 
rule can be defined as a relation involving the values of the variables, e.g., by defining 
a threshold like x > 5, or by checking the exact value such as x = 3.2. For generating 
these type of rules, the generator randomly picks a continuous feature f, and a random 
threshold value taken from the set of values assumed by f in the training set. Finally, a 
random relation is drawn from the set {<, >, =}. In some of the experiments we used 
a reduced set of relations (this is specified in Section p). Note that discrete variables 
can be considered as a special case of binary variables since it is possible to convert 
them into binary ones through one-hot encoding. 

Finally, the generated rules can be also combined using conjunctions. Specifically, 
given two or more rules, their product corresponds, from a logical point of view, to 
the conjunction of the conditions. In the remainder we will refer to the arity of this 


combination as the degree of the rule. 
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Figure 3: Simple example of a random forest with three decision trees: light blue circles are the decision 
nodes, while the small red circles are the end nodes. In the figure it is assumed that left branches mean that 
the rules in the decision nodes are satisfied. Each leaf represents a possible decision rule that the feature 


generator can extract. In particular, the depicted forest allows to extract 10 possible rules. However, there is 
a pair of rules that are actually the same: (4)=(8), i.e., (f2 > ta) A (f3 < ts). 


6.3. Decision path rules generation 


Generating rules as described in Section |6.2|can not be optimal since the rules are 
generated in a completely random fashion. The decision path rules generator tries to 
overcome the limitation of the previous method by taking advantage of the nature of 
the decision paths in decision trees (DTs). DT paths are based on a split criterion that 
is usually defined in terms of some entropic index. 

At classification time a DT takes the instance and traverses the tree according to 
the value of the features in the split nodes. When an example reaches a leaf, it means 
that it has satisfied all the rules along the decision path. The Decision path rule gen- 
erator, given a random forest, picks at random one of the possible decision path (i.e., 
conjunction of relations) from a randomly picked tree of the random forest. 

Figure [3] shows an example of a random forest composed by three decision trees. 
The total number of decision nodes corresponds to the total number of possible decision 
paths. However, some paths represent the same rule, e.g., @=©@), i.e., (f2 > ta)A(f3 < 
t3). In PRL, once the set of all possible decision paths is extracted, the Decision path 


rule generator randomly picks one path and generates the corresponding rule. 
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6.4. Bias feature 

As we will see in the experimental section, whenever a class can be defined by 
reasonably simple rules, PRL has the capability of identifying them all. However, when 
there are classes which cannot be characterized by rules (or the rules are too complex) 
PRL may fail in extracting reasonable explanations. This is due to the fact that when 
a class A is logically defined as the negation of another class B, by design PRL still 
searches for features that characterize A > B even though it cannot be defined in a 
reasonable amount of rules. 

Let us make a toy example using the tic-tac-toe game to explain this concept. The 
task is to classify whether a tic-tac-toe configuration is a win for the cross (x) or not. 
It is clear that it is quite simple to define when there is a win: whenever there are three 
crosses in line. However, how can a configuration that is not a win be defined? The 
easiest way is by rejecting all the winning configurations. Otherwise the only way is 
analytically describing each non winning configuration for the cross that is however 
not convenient. Unfortunately, in this scenario the best PRL can achieve is to identify 
a set of rules that are able to discriminate only small subsets of not winning training 
instances, but with rather small generalization capabilities. In some sense we can say 
that PRL is overfitting. 

To address this issue, we introduce an artificial feature (1.e., rule) that is set to be 
true for every preference. That is, each example in the training set has the feature T 
whose value is equal to 1. Then, we allow the feature generation mechanism to pick the 
T rule together with the other rules. We will call it bias rule/feature. When such feature 
is selected and associated with a preference, it will give a bias towards the preferred 
class. Clearly, the bias feature per se has no discriminative capabilities, thus if a class 
can be characterize with a small subset of rules PRL will still be able to do it. However, 
in cases similar to the tic-tac-toe example, where a class is simply the negation of the 
other, then the bias rule will play a key role. All examples of such class trivially satisfy 
the bias rule and hence its weight will be reasonably high. This can be interpreted as 
“label A is preferred to label B because there is no evidence to say the opposite”. 

The bias feature has been used in all experiments concerning the poker dataset 


and also in some experiments on tic-tac-toe. 
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7. Evaluation 


In this section, we describe the empirical evaluation of PRI[] and the assessment 
of its effectiveness. Specifically, our experiments focused on three main aspects asso- 
ciated with PRL: evaluating the degree of interpretability of the proposed model, the 
possibility of visualizing the model decisions, and assessing the performance of the 
feature selection. 

In all the experiments the number of iterations Te of Parallel FictPlay has been set 
to 10° (8 parallel executions of FictPlay have been run), while the number T of epochs 
of PRL has been set to 10°. All experiments have been performed using dynamic 
budget PRL with initial budget B = 500. We group the set of experiments on the basis 
of their purposes. The first set aims to assess the degree of interpretability of PRL as 
well as its effectiveness on some benchmark datasets. The second set of experiments, 
instead, focuses on the evaluation of the performance on datasets with a huge number 


of features. 


7.1. Model interpretation 


In the first set of experiments, we employed PRL to select the most relevant features 
for interpreting the decisions. We ran PRL on four benchmark datasets. The details of 


the datasets are summarized in Table [I] 


Dataset #Instances #Features #Classes 


tic-tac-toe 958 27 2 
breast-cancer 682 9 2 
poker 25010 52/69/74* 10** 
mnist 10000 784 10 


Table 1: Datasets information: name, number of instance, number of features, and number of classes. All the 
dataset are freely available in the UCI repository. (*) the poker dataset has 3 versions with different number 
of features. (**) the original dataset has 10 classes, however in our experiments three binary classification 
tasks have been defined. 


Implementation available at https : //github.com/makgyver/PRL 
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7.1.1. The tic-tac—toe dataset 

As a general test bed, we selected the t ic-tac—toe dataset, in which each ex- 
ample describes a possible final configuration of the tic-tac-toe game, and examples are 
labeled as positive iff the x player is the winner. The dataset has been converted into 
a binary-valued dataset through one-hot encoding, obtaining 27 binary input variables 
for each instance. The 27 features represent a specific position on the board, in which 
each cell can have a cross (x), a nought (©) or can be empty. Each cell is encoded in 
three consecutive binary features £i, £i+1, £i+2 where x; means empty, x;+1 Means 
nought and x;+2 means cross, for i = 3n with n € [0,...,8]. Note that with this en- 
coding the positive class can always be expressed as a single DNF rule which describes 
all the possible eight 3-cross-in-a-line configurations, and negative otherwise. 

A winning position is characterized by the simultaneous activation of three specific 
features (either columns, diagonals or rows), thus can use such a-priori knowledge 
to select an appropriate feature space. For example, polynomial features are suited 
for this purpose because they correspond to conjunctions when the input features are 
binary. Thus, we used the polynomial features generator of degree 3. The experiment 
setup was the following: 70% of the dataset has been used as training set, while the 
remaining 30% was used as test set. 

After the training phase, the top 10 features that PRL weighted the most were the 
following: ©g%17%26, £2111 X20, L2X 14126, LgL14X 20, T2023 26, V11X 14217, T2L5T8, 
L5L14T23, aie and toe Features marked with ~ are the ones characterizing a negative 
preference (no win for x). Observe how the first 8 features correspond correctly to the 
winning configurations (three-in-a-line) for the cross. The remaining features represent 
respectively a naught in the central and in the bottom right cell. The last one does not 
seem to be particularly informative. Conversely, the former suggests negative evidence 
that cross won, since occupying the centre is often a good strategy. In fact, a naught 
in the centre is correctly associated with a negative preference (if naught occupies the 
centre, it is less likely that cross has won). This highlights the strong interpretability 
of the model. Overall, the algorithm has been able to identify all the conditions that 


determine a win for the cross. 
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Moreover, using the tic-tac-—toe dataset, we provide a further investigation 
on the bias feature: we ran two different versions of PRL with degree 3 polynomial 
features generator: one including the bias feature and one without it. Figure [4] shows 
the weights associated with rules. Both versions correctly identify as strongest rules 
the three-in-a-line for the cross (first 8 rules). As 9th feature, we found either the bias 
feature (in the version of PRL that includes it) or, as previously described, a naught 
in the centre of the grid. This is particularly evident considering the plots in Figure |4] 
Without the bias, negative rules, that can describe a losing configuration for cross, 
are more and more relevant (higher tail for the red line). When the bias feature is 
available, the weight shared among rules for non-winning configurations is aggregated 
onto the bias feature. Additionally, rules that describe winning situations are weighted 
more. The bias feature expresses a single rule that says “if the configuration is not 
winning for cross, then it is either losing or a tie”, without specifying how a non- 
winning configuration looks like. 

Other than using polynomials of degree 3, we tried to extract logical rules as pro- 
posed in Section [6.2] Polynomial features correspond to conjunctions of positive lit- 
erals, while through the rule generation scheme we can encode also negative literals. 
This allows the introduction of new rules, such as 421 A 7213 A 7225 to which PRL 
assigned an high weight. In fact, this rule expresses, in a human-readable fashion, the 
concept that exist some winning configurations with no naught on the diagonal. It is 
easy to demonstrate that the suggested rule (although not intuitive for a human being) 
correctly describes a sufficient (yet not necessary) condition to define a winning config- 
uration for cross. If there isn’t any naught on the diagonal, then it cannot be a winning 
configuration for naught. Moreover, it cannot be a tie, since any tie requires the entire 
grid to be filled, but if no naught is on the diagonal, then the diagonal is occupied by 


crosses, and thus it must be a winning grid for cross. 


7.1.2. The poker dataset 
The poker dataset consists of a set of examples representing poker hands. Each 
hand is composed by 5 cards taken from a standard poker deck of 52 cards, with 4 suits 


and 13 ranks (Ace to King) for each suit. The original task of this dataset is to identify 
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Figure 4: Rule weights of the most relevant rules extracted by PRL on tic-tac-—toe using polynomial 
feature generator of degree 3 with (blue) and without (red) the bias feature. 


the value of the hand. The 10 possible hand values are: Nothing, Pair, Double pairs, 
Three of a Kind (TOK), Straight, Flush, Full house, Four of a kind (FOK), Straight 
flush and Royal flush. 

The experiments performed on the poker dataset was intended to assess the qual- 
ity of the features selected by PRL to explain the decision. To this end, we have created 
a hierarchy of features with the aim of investigating whether PRL could identify, at each 
level, the best subset of features/rules useful to accomplish the task. The hierarchy of 


features was defined as in the following: 


e Level 1: The first representation of the dataset describes a poker hand as a triv- 
ial enumeration of the cards contained in it. The hand is described through a 
vector of dimension 52, where each dimension corresponds to a specific card 
and has value equal to 1 or 0, whether the card is present in the hand or not: 


[AY,29,...,A0,20,..., J, Qa, Ka] € {0, 1}°2. 


e Level 2: The second level considers aggregated features obtained by counting 
either the suits or the ranks of the cards in the hand. Specifically, this new level 
is obtained by adding 4 new dimensions to the previous 52, that describe the 
counting of the suits, and 13 additional dimensions that describe the counting of 
the ranks, i.e., #0, #0, #%, #@] € (0, 5]* and [#A, #2, #3,...,#Q, #K] € 
(0, 4]3. 
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Note that the Ace is assumed of rank 1, while J=11, Q=12, and K=13. 


e Level 3: In this level we create 5 features that are a further aggregation of the 


features in Level 2. Specifically: 


max{ #9, ..., #6}: number of cards of the most popular suit in the hand; 
— max{#A,...,7#K}: number of cards of the most popular rank in the hand; 
- card({V,..., @}): number of different suits in the hand; 
— card({A,...,K}): number of different ranks in the hand; 


— max(diff(ranks)): the largest ranks difference between two cards in the 
hand. In this case we associate to the Ace the rank 1 or 14 which minimizes 


the maximum difference between the other cards. 


Let us make an example to clarify this features hierarchy. Given the hand AY, 79, 
AQ, JA, 3%, at the first level of the hierarchy 5 entries out of 52, the ones correspond- 
ing to the cards in the hand are equal to 1 and all the rest are zero. At the second level, 
the suit vector is [2, 0, 1, 2], while the rank vector is [2, 0, 1, 0, 0, 0, 1, 0, O, 0, 1, 0, OJ. 
These vectors are appended to the previous one giving a 69 dimensional vector. At the 


third and final level the features values are the following: 
e max{#9,..., #4} = #4 = #9 =2; 
e max{#A,...,#K} = #A = 2; 
e card({V,...,@}) = card({V, @, do}) = 3; 


e card({A,...,K}) = card({A, 3, 7,J}) = 4; 


max(diff(ranks)) = J— A = 10. In this case the Ace is associated with the value 


l since J — 1 < 14-3, 


which produces the vector [2, 2,3,4, 10]. Thus, in this last level the total number of 
features is 74. 
We defined three binary classification tasks: TOK (2.05% of the dataset) versus 


rest, Flush (0.22% of the dataset) versus rest, and Straight (0.37% of the dataset) versus 
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rest. It is worth noticing that these combinations are included in more valuable hands, 
for example: TOK is also part of the Full house and the FOK. However, a TOK is not 
a FOK or a Full house. This fact can cause false positives at classification time. 
Alongside evaluating the extracted rules, we assessed the quality of our model using 
balanced accuracy. Balanced accuracy has been chosen over the more standard accu- 
racy measure due to the strong imbalance between the positive class and the negative 


one. The balanced accuracy is defined as: 


1/TP TN 
BACC = - | — + —— 1 
CC 5(4 + ME 00, 


where TP stands for true positives (P = positives), and TN for true negatives (N = 
negatives). 

PRL has been compared against SVM with polynomial and RBF kernels. In partic- 
ular, SVM has been validated via 5-fold cross validation so that the C hyper-parameter 
was validated among the set {10~3,..., 104}, the degree of the polynomial kernel in 
the range [1,3], and the shape parameter y of the RBF kernel has been validated in 
the set {10~?,..., 107, (# features x var(X))~1}. PRL has been trained using rules 
of degree 1 on the set of relations {=}. A possible example of rule is the follow- 
ing: £13 = 1 which corresponds to stating that the fourteenth feature, i.e., the ace 
of diamonds, is set to 1, and thus available in the hand. Experiments have been per- 
formed using a 70-30% training and test split division. Moreover, our model has been 
compared against the Random Forest Classifier, validated using 5-fold cross valida- 
tion, selecting the hyper-parameter associated with the number of estimators in the set 
{10, 100, 1000, 5000}, the maximum depth in {2, 5, 10, until pure leaves} and the split 
criterion in {Gini, entropy} 

The achieved results are reported in Table [2] With the name PRL we refer to PRL 
with the rule generation scheme, while PRL-RF means PRL with the decision path rule 
generation scheme. 

A remarkable difference w.r.t. the SVM can be noticed in the Straight classification, 
which is the hardest task. SVM simply classifies the majority of instances as negative 


due to the strong unbalancing of the dataset. This behavioural pattern is repeated in 
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Method #Level TOK Straight Flush 


SVM 1 79.20 53.00 59.38 
RF 1 51.76 50.00 50.00 
PRL 1 48.08 49.97 59.03 
PRL-RF 1 61.42 51.48 50.00 
SVM 2 85.42 56.04 59.43 
RF 2 99.62 50.00 59.38 
PRL 2 100.00 52.72 84.36 
PRL-RF 2 98.10 52.72 84.36 
SVM 3 99.99 96.43 96.85 
RF 3 100.00 90.90 100.00 
PRL 3 100.00 100.00 100.00 
PRL-RF 3 100.00 100.00 100.00 


Table 2: Balanced accuracy (%) on the poker dataset. The highest accuracies in all classification tasks, and 
in all levels, are highlighted in bold. 


all the tasks (except for TOK) both at the second and at the first level of the hierarchy. 
Concerning the third level, SVM had one false negative in the TOK task and one false 
positive in the Flush, achieving a BACC of 99.99% and 96.43%, respectively. 

As shown in Table B]in the first task, i.e., three of a kind, the best performance is 
achieved by SVM. We explain this due to the different kind of combination of features 
considered. Rule-based algorithms like random forest or PRL have to enumerate cards 
appearing in the three of a kind: on this dataset, where each hand appears only once, 
this approach is unable to generalize. On the other hand, SVM, in particular the version 
based on a polynomial kernel of degree 2, tries to describe a three of a kind using 
couples of cards having the same rank. Due to the fact that a couple of cards of the 
same rank happens to be in two different versions of a three of a kind (depending 
on the remaining card of the three), this mechanism is somehow able to generalize 
to unseen examples and thus achieves better results in previously unseen examples. 
PRL-RF (the PRL version based on rules extracted from threes of a random forest) 
achieves the second best result: we guess this to be due to the fact that this approach can 
consider a number of conjunctions of cards of cardinality 2 (namely the pair of cards) 
and thus behaves similarly to SVM. The first level of features is unable to explain a 


Straight hand nor a Flush hand. PRL is completely able to gather rules to explain 
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Figure 5: Value of the game w.r.t. the iteration of PRL on the Straight (left) and Flush (right) task. 


a three of a kind using the second level of features, whilst SVM achieves the worst 
results, probably relying mainly on level 1 features and exploiting only partially level 
2 features. Almost all algorithms performs poorly on the straight hand task, using 
level 2 features. Specifically, an hand contains a TOK anytime one of the rank has a 
cardinality = 3. The rule “cardinality 3 of a rank”, yet suffers when it comes to false 
positives detection: a full house hand is a false positive for such rule. Similarly, a flush 
can be described with a suit of cardinality 5, but this also includes the straight flush and 
the royal flush as false positives. In both these tasks, PRL found the correct rules. 
With the features contained in the third and final level of the hierarchy it is possible 


to define all the considered concepts: 
e TOK: # of ranks = 3, # of cards of the most popular rank = 3; 
e Straight: # of ranks = 5, max difference between ranks = 4, # of suits Æ 1; 
e Flush: # of suits = 1, max difference between ranks 4 4. 


At this level, PRL was able to identify all the correct rules achieving a BACC of 100% 


in all the tasks. 


We have already demonstrated that the value of the game monotonically increases 
at any iteration of PRL. This fact is further confirmed by the plots in Figure |5| The 
plots show how the value of the game changes until iteration 50. In both cases the 


maximum value had been reached since PRL already had discovered the best rules. 
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In particular, Figure [5}regards the Straight classification and it can be noticed that 
there have been 3 quick changes in the value: at iteration 16 (red dashed line), 39 
(green dashed line) and 46 (grey dashed line). Until iteration 16 the set of rules was 
immature to make any analysis. At iteration 16, the 8 best rules were the following: 
max(diff(ranks)) = 5,6,...,11, to support the negative class. These rules represent 
almost all (12 is missing) the possible maximum differences for the ranks greater than 
4, that is actually the value useful to identify the Straight. The only rule for the positive 
class was # ranks = 5, which is correct since in a Straight all ranks are different. 

Then, at iteration 39, PRL found: max(diff(ranks)) = 5,6,...,12 for the negative 
class and # ranks = 5 for the positive one. So the only difference is the inclusion of the 
12 in the maximum difference between ranks. Even though the overall rule is correct, 
it is still not optimal because, in order to express the positive rule max(diff(ranks)) = 4, 
PRL discovered the same concept but using a bunch of negative rules. 

Finally, at iteration 46 the right set of rules has been found solving the task per- 
fectly, that is, max(diff(ranks))= 4 and # ranks = 5 for the positive class, and # suits 
= 1 for the negative one, that excludes the two cases in which the Straight is also a 
Flush or a Royal flush. Here, the bias rule has been also extracted to represent all the 
other cases when the two rules for the flush are not satisfied. 

On the Flush task, the value of the game had a similar behaviour as for the Straight. 
In this case the value drastically jumped only twice: at iteration 8 (red dashed line) and 
14 (green dashed line). At iteration 8 the discovered rules, despite being correct, were 
a bit “chaotic”: max suit = 2,3 and # of suits = 2,3 for the negative class, as well as 
max(diff(ranks)) = 4 (to exclude a straight) and #0, #9, #%, #@ = 1. These rules 
are all correct to exclude a flush, in fact, they require a suit with 5 cards and hence a 
unique suit in the hand. For the positive class the only rule extracted was max rank = 
1, which is also correct because a flush implies that all ranks in the hand are unique. 
Nevertheless, some iterations later (14) PRL found the optimal set of rules, that is, 
max(diff(ranks)) = 4 for the negative class, and max suit = 5 for the positive. Similarly 
to the Flush case, the bias rule has been extracted by PRL to include all the cases in 


which the hand does not contain a straight. 
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7.1.3. The breast-cancer dataset 

The Wisconsin Breast Cancer dataset (breast-cancer) Is a standard UCI 
dataset that contains 682 hospital patients’ values captured via a Fine-needle aspiration 
test. Each patient is described by 9 attributes concerning breast tumoral cells. The task 
consists in classifying a tumor between benign or malignant. The classes distribution 
is 35% benign, and 65% malignant. 

Compared to other testbeds, breast-cancer can be considered a real-world 
dataset in which simple rules are unable to completely describe whether the tumor is 
malignant or not. Thus, it is not possible to compare retrieved rules w.r.t. a given 
ground truth. To evaluate PRL, it was therefore necessary to compare it with other 
rule extractions algorithms, as proposed in [24]. In particular, the quality of our ap- 
proach was assessed by applying retrieved rules directly on the dataset and comparing 
the accuracy obtained by different approaches. Note that, differently from previous 
experiments, in this case, the splitting between training and test set was 90-10%. This 
evaluation procedure has been chosen because, for each model, we only had at our dis- 
posal the set of extracted rules after a 10-fold cross validation procedure. To train PRL, 
rules of degree 2 on the set of relations {<, >} have been used. In Table[3|the achieved 
results are summarized, while Table[4]shows the extracted rules of each method. 

Figure|6Jhighlights an interesting observation on PRL applied to breast-cancer. 
The figure shows performances achieved by PRL when only a subset of rules is con- 
sidered. The size of the subset of rules goes from | up to 50. The first three rules are 
not enough to get good results: they are likely associated with statistical occurrences 
on the dataset and are unable to capture insight over the data. This changes with the 
fourth rule, which, alone, is able to achieve more than 92% of accuracy. The first four 
rules together are able to exceed 95% of accuracy, while the remaining rules are used 
to classify outliers and harder examples and produce an almost monotonic increase in 
accuracy. Overall, the PRL approach is able to achieve 99.56% of accuracy, using 50 
rules. Being a model with 50 rules hardly interpretable, in Table[3|we present a compar- 
ison of a PRL with a strongly limited set of rules (up to 10), against other approaches. 


As highlighted in Table B] PRL is able to achieve the best accuracy, using the 10 most 
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Method Ref # Rules Accuracy (%) 


SSV [25 3 86.36 
GASVM [26] 2 90.03 
C-MLP2LN [25 5 96.92 
QSVM-G [27] 12 96.48 
ReRXJ48 4 94.28 
PRL only 4th - 1 92.67 
PRL@5 - 5 96.12 
PRL@10 - 10 97.95 


Table 3: Accuracy of the rules extracted by the different algorithms. The highest accuracy is highlighted in 
bold. 
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Figure 6: Plot of the accuracy on breast-cancer w.r.t. the number of considered rules during the 
classification. 


relevant rules. Observe that, even with a lower number of rules (i.e., 5) the results 


achieved by PRL are still very good, exceeded only by C-MLP2LN and QSVM-G. 


7.2. Visualization: mnist dataset 


The mnist dataset is one of the most widely used dataset for the hand-written 
digit classification task. The digits are stored in a grey scale 28 by 28 pixel matrix, 
where each pixel can assume a value between 0 and 255 (0-1 normalized). The task is 
to recognize the digit represented by an instance. For the purpose of this experiments 
we perform all possible 1-vs-1 binary classifications between digits. 

Akin breast-cancer, the classification in mni st cannot be done through sim- 
ple rules. To present the PRL in an interpretable fashion, we aim to show how the most 


relevant visual features are leveraged by the model to distinguish examples belonging 
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Method Rules 


C-MLP2N CT< 6A UCSH< 3A BC< 8 
CT< 9A MA< 4A BN< 2A BC< 5 
CT< 10 A UCSH< 4A MA< 4A BN< 3 
CT< 7 A UCSH< 9A MA< 3A4<BN<9ABC< 4 
3 < CT < 4 A UCSH< 9A MA< 10A BN< 6A BC< 8 


SSV MA> 2.5 A BC> 2.5 
MA> 2.5 A BN> 3.5 A BC> 0 
UCSI> 5.5 A MA< 2.5 A BC> 1.6 


GASVM CT< 7.09 A UCSH < 7.91 A SECS < 9.76 A BC< 6.06 
UCSH< 7.7 A BN< 9.41 A BC< 6.12 A M< 7.43 


QSVM-G CT< 10 A UCSH < 9.95 A BN< 6.93 
CT< 7.00 A UCSI< 5.97 A SECS< 4.97 A BN< 4.94 A NN< 9.94 
UCSH> 2.97 A BN> 4.94 
CT> 4.96 A UCSI> 4.00 
UCSI> 4.98 
CT> 2.984 ^A BN> 6.93 
CT> 5.98 A UCSI> 3.00 A UCSH> 3.99 
UCSH> 2.97 A SECS> 4.97 
NN> 8.96 
UCSI < 3.00 A UCSH> 3.99 A SECS< 4.97 
CT< 2.98 A UCSH< 4.95 A BN> 9.95 
CT> 10.00 A UCSI< 3.00 A BN< 7.96 


ReRXJ48 BN=1 
CT <4A1<BN<6 
CT <4ABN>6 
CT>4ABN> 1 


PRL MA > 2A UCSI > 5 
BN > 3 A SECS < 4 
MA < 2A SECS > 3 
CT <6AMA<5 
NN < 2A SECS > 2 
SECS <2 A MA>3 
BN>6ABC>4 
BN>SANN <1 
UCSH < 3A BC >4 
CT <6ANN <8 


Table 4: Rule extracted by the rule extraction algorithms reported in (Hayashi and Nakano 2015) and PRL. 
The class column indicates whether the rule define the positive class (M) or the negative class (B). 
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(a) 0 versus 9 (b) 4 versus 6 


Figure 7: Visualization of the most relevant polynomial features of degree 2. The polynomial features are 
visualized as segments limited by the involved input variables. The left hand side plot shows the features 
relevant to discriminate the (a) 0 from the 9 and (b) 4 from the 6. Viceversa in the right hand side plots. 


to either a class or the other. 

Experiments have been performed using polynomial features of degree 2. Figure [7] 
illustrates two examples of the most relevant features used by the model to distinguish: 
(a) 0 from a 9 (left) and viceversa (right); (b) 4 from a 6 (left) and viceversa (right). 

Each feature is presented using a segment between the two features (i.e. pixels) 
that concur to the decision process, in each rule (i.e. monomial). The background 
represents the average digit of the depicted class. 

Plots presented in Figure [7(a)| show how curvatures are used to distinguish the 0 
from the 9. In particular, the algorithm looks for a “big” curvature to recognize ele- 
ments belonging to the 0 class, and a smaller one to identify the 9. Figure [7(b)|depicts 
a similar behaviour to distinguish 4 and 6. Again, the 6 is characterized by curvatures, 
while the horizontal dash is considered to be the important aspect to recognize the 4. 

Figure|8]shows how the value of the game has changed over the iterations of PRL, 
while Figure p] shows the full set of digit-vs-digit plots. In the rows there are the 
preferred class, while in the column the not preferred one w.r.t. the corresponding class 


in the rows. 


7.3. Feature Selection 


This set of experiments aims at assessing the effectiveness of PRL on datasets with 
many noisy and redundant features. The chosen testbeds have been the datasets of 


the NIPS 2003 Feature selection challenge [29]. All datasets are freely available at the 


NIPS 2003 Feature selection challenge site,|http://clopinet.com/isabelle 
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Figure 8: Value of the game w.r.t. the iteration of PRL on the mnist dataset. 


Dataset #Inst. #Feats. # Real feat. Class prior 


dorothea 1150 100k 50k 90/10 
gisette 7k 5k 250 50/50 


madelon 2.6k 500 20 50/50 


Table 5: Datasets information: name, number of instance, number of features, number of relevant features 
(probes), and class prior. 


Projects/NIPS2003/) Further details about the datasets are reported in and 


(30). A common characteristic of these datasets is the huge number of features com- 
pared to the number of training instances. All datasets consist of binary classification 
tasks. Table[5]summarizes the characteristics of the used datasets. 

We compared PRL with standard soft-margin SVM and Random Forest Classifier. 
Given the huge number of features of the target datasets, the linear kernel turned out to 
be a good kernel for these tasks, with the exception of madelon in which the degree 2 
homogeneous polynomial was the best performing kernel for SVM. Moreover, we eval- 
uated the PRL using the Decision path rules generator (dubbed PRL-RF). The C hyper- 
parameter of the SVM has been validated in the set of values {10~*,..., 10°} using a 
5-fold cross validation procedure. Experiments have been performed using a 70-30% 
training and test split. The Random Forest Classifier has been tuned selecting the hyper- 
parameter associated with the number of estimators in the set {10, 107, 10°, 5000}, the 
maximum depth in {2, 5, 10, until pure leaves} and the criterion between {Gini, entropy} 


and validating them via 5-fold cross validation. In Table [6|the results achieved by both 
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Figure 9: Depiction of the relevant features extracted by PRL for each possible pair of digits. 


methods as well as the number of relevant features according to PRL are summarized. 

As evident from the table, the proposed method is able to achieve better perfor- 
mance than SVM. It is worth to mention that, generally, the number of features used 
by PRL was orders of magnitude less than the number of original features. Both PRL 
and PRL-RF have higher (or comparable) performance w.r.t. SVM and RF. It is inter- 
esting to observe that PRL-RF consistently achieves better performance than PRL, and 
this can be ascribed to the quality of the features. Another observation that is worth to 
mention is that PRL-RF consistently has better performance than RF (the same cannot 
be said for PRL) and this underline the effectiveness of the feature selection capability 


of PRL. 


34 


670 


675 


680 


685 


Dataset SVM RF PRL PRL-RF # Relevant/Tot. feat. 


dorothea 91.88 78.13 92.69 93.33 500/100k 
gisette 96.71 97.22 97.19 97.67 900/5k 
madelon 60.10 71.05 62.75 74.49 1225/250k 


Table 6: Accuracy results achieved by SVM , RF, PRL and PRL-RF. The last column indicates the number 
of support preference-feature pairs used by PRL. The best results are highlighted in bold. 


8. Related work 


In this section we discuss related work to the proposed method. We give particular 
attention to game theoretical concepts related to machine learning especially to large 


margin methods, and on-line (non linear) feature selection. 


8.1. Game theory and machine learning 


Connections between large margin methods and game theory have been already 
discussed in literature. In [5], Couellan investigates connections between supervised 
classification and generalized Nash equilibrium problems. Specifically, the geometri- 
cal properties of the separation hyperplane of SVM in the dual space are exploited to 
formulate a non-cooperative game. The intuition behind this game theoretical formu- 
lation is that the two players are associated with the positive and the negative class, 
respectively. The goal of each player is to “pull” the hyperplane close to himself. In 
the paper, the proposed formulation is then extended for the multi-class setting. Similar 
observation has also been done in in which hard-margin SVM is cast into a two- 
player zero some game. Starting from this observation, authors propose a kernel-based 
method for the direct optimization of the margin distribution. 

In [BI], Polato et al. propose a preference learning framework inspired by game 
theory for multi-class classification problem. The framework defines a single opti- 
mization problem related to the optimal strategies of a two-players zero-sum game. To 
improve the efficiency, authors propose an approximated solution which requires the 
sequential optimization of many sub-games. 

In [32], a game theory approach for solving multi-class classification is presented. 
In this work pairwise classification is seen as a decision-making problem and authors 


show that pairwise SVM can be cast into the proposed GT framework. They also 
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prove that the solution of the proposed approach is equivalent to the fuzzy pairwise 
SVM [33]. Game theory, specifically Shapley values, is also used as a surrogate model 
for interpreting machine learning models. propose FAE (Formulate, Approximate, 
Explain), a conceptual unified framework for generating and interpreting explanations. 
FAE generalized methods such as and [36]. 

Under an applicative perspective, game theoretical concepts are highly exploited 
in the cyber-security community because it is related to the adversarial nature of an 


attacker, e.g., [371 (2). 


8.2. On-line feature selection 


One of the first proposed approaches for performing feature selection in the feature 
space has been [B88]. In this work, Cao et al. extended Relief [B9], a margin based 
feature selection approach, in kernel space by deriving a basis set in the feature space 
that is used to compute the distance (useful in the computation of the nearhit and the 
nearmiss) in the feature space using the kernel trick. 

In [40], Nguyen et al. proposed a convex energy-based framework to jointly per- 
form feature selection and non linear SVM parameter learning. Authors empirically 
show that the proposed method shows significant reduction of features used while 
maintaining classification performance. 

More recently, a similar approach has been proposed by Adeli et al. for early 
diagnosis of Parkinson’s disease. The core idea behind the proposed method is the 
learning of different kernels for each feature, as in the Multiple Kernel Learning frame- 
work, but for each single feature. Then the optimization problem learns how to weight 
these kernels and the weights represent how much discriminant the features are in the 
feature space. 

The main difference between PRL and the just mentioned approaches is that, thanks 
to the feature generator component, PRL has the capability of treating problems with 
infinitely many features. PRL, like HOSFS [42], is also theoretically suitable for deal- 
ing with streaming of features, however, its main limitation is the efficiency that it 
could not be ideal in settings with high throughput. 


One of the biggest challenges in feature selection is dealing with large scale data in 
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particular with (infinitely) many input features. This is a typical scenario in real-world 
applications when data instances have high dimensionality or it is expensive/inconvenient 
to acquire all attributes. In these contexts, batch approaches are simply not applicable 
for computational reasons. Thus, there is the need to move towards on-line feature 
selection (OFS) approaches [43] 44], which can work with a small and limited number 
of features. For a comprehensive comparison of linear and non-linear feature selection 


methods we refer the reader to [45]. 


9, Conclusions and future work 


This paper proposed a new preference learning approach for classification (and 
label ranking) based on game theoretical concepts. The learning problem is seen as 
a two-players zero-sum game solved by a novel incremental algorithm. We provided 
theoretical guarantees about the convergence of the algorithm as well as an extensive 
set of experiments demonstrating its effectiveness. We also showed the capability of 
PRL in identifying explanation rules for interpreting the predictions. 

In the future we aim at applying PRL for extracting feature correlations. For exam- 
ple by creating artificial tasks where a target feature is used as label and the extracted 
rules describe how other features correlate with the target one. Moreover, we aim at 
introduce new feature generation schemes. A possibility could be to explore random 
feature generation methods such as the Rahimi and Recht random features [46]. Fi- 
nally, we also intend to relax the PRL formulation in order to get a soft margin version 


of the algorithm. 
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